perf: optimize Hunyuan DiT Ulysses and non-attention paths#1200
perf: optimize Hunyuan DiT Ulysses and non-attention paths#1200starrkk wants to merge 6 commits into
Conversation
(cherry picked from commit 8f06fb6c7e0859f432a329a84f8d5d8e3a386ad1)
Support split image/text QKV inputs, optional split attention outputs, async text all_gather, and profiler ranges for Ulysses sequence-parallel attention. (cherry picked from commit 8bb7c3e1784140a8f6d372fe429b468e3a502b8b)
Reuse dynamic activation quantization across consecutive Q/K/V projections and route split image/text tensors through the Ulysses attention path when enabled. (cherry picked from commit 61c5df5c20106254d5294b910cdf3d1780970a97)
There was a problem hiding this comment.
Code Review
This pull request introduces several performance optimizations for Ulysses attention and Hunyuan Video transformer inference, including support for split QKV inputs/outputs to avoid copy overhead, asynchronous text gathering, buffer reuse, shared dynamic activation quantization, and optional torch.compile support for non-attention branches. The reviewer identified several critical issues: potential NCCL hangs due to overlapping collective operations on the same process group, high compilation overhead and cache thrashing from compiling functions with custom weight objects, an unbounded memory leak in the text gather buffer cache under dynamic prompt lengths, incorrect text mask length calculation when using split QKV inputs, and a potential AttributeError in the shared quantization check if key/value weights lack the expected quantization methods.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if split_qkv_input: | ||
| img_qkv_len = img_q.shape[0] | ||
| txt_qkv_len = txt_q.shape[0] | ||
| txt_mask_len = None |
There was a problem hiding this comment.
When split_qkv_input is True, txt_mask_len is unconditionally set to None. However, if len(cu_seqlens_qkv) == 3, there is a text mask, and txt_mask_len should be computed as cu_seqlens_qkv[2] - slice_qkv_len to ensure correctness for other models or configurations using this generic Ulysses attention implementation.
| if split_qkv_input: | |
| img_qkv_len = img_q.shape[0] | |
| txt_qkv_len = txt_q.shape[0] | |
| txt_mask_len = None | |
| if split_qkv_input: | |
| img_qkv_len = img_q.shape[0] | |
| txt_qkv_len = txt_q.shape[0] | |
| txt_mask_len = cu_seqlens_qkv[2] - slice_qkv_len if len(cu_seqlens_qkv) == 3 else None |
|
Please do not include profiles in inference operators. Please do not use environment variables as switches. |
|
Addressed in
|
Summary
torch.compilewrappers controlled by model configWhy
These changes reduce Python/tensor layout overhead around HunyuanVideo DiT inference and let Hygon DCU deployments reuse activation quantization for consecutive Q/K/V projections. Runtime choices are now passed through config/API parameters instead of environment-variable switches.
Validation
ModelTC/LightX2V:main(89dfa833)ruff check --config=pyproject.tomlpassed for the touched filesruff format --check --config=pyproject.tomlpassed for the touched filespython -m py_compilepassed for the touched filesLIGHTX2V_*env switches or profiler ranges remain in the touched operator files